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Abstract 

'The research literature orf short-term instruction CSTI) and intermediate- 
term instruction (ITI) for the SAT-mathematical sections and SAT-verbal 

' sections was reviewed* Selected studies of STI arid ITI for tests other - ^ 
than the SAT-M and SAT-V, and of testwiseness <TW) » were included in the 
survey if they were judged relevant to the question of special instruction 
■for the SAT* ' * 

The* research studies w^re reviewed ^nd interpreted within the framework 
of a score colnponeilts model that posited four consent-related and two TW 
score ccmponents; as well as t^st-taking confidence and efficiency^ that * 
are theoretically subject to STI and ITI\effects* "In addition^ eAamljofee^ 
item^ and instructional characteristics were considered as they relate to 
the score components model* 

' Hasic discrepancies between negative and positive findings were no.ted 
for bo'th the SAT-M, and the SAT-V* Th^e were generally resolved in favor^of 
recognizing meaningful STI effects for the SAT-M^ but remain unresolved for 
the SAT-V*. Recommendations were made for SAT-M and SAT-V research allowing 
STI e'ffects to be partitioned according to examinete^ item^ and instructional 
characteristics as they apply to selected test pcore components* 
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Introduction 



This study was requested by the College Board to provide an up-to-date^ 
summary of research findings 'relevant to the question of special Unstfruc^ 
tion for the Scholastic Altitude Test* The need for such a review lies ' 
both in tfie continued relevance of the question, and 13n the fact that the 
last sutmoary was completed several uears ago (College Board^, 1968)* 

Question's regai;;ding special instruction for the SAT remain relevant for 
several reasons* One reason is that the continued importance of^ SAT scores 
to examihees results in a continued pressure to obtaiti "instruction for 
the SAT," which 'in turn leads to an active commercial "coaching" enterprise 
and to efforts by some public and private^ schools to provide such instruc- 
tion* Another is that the changing inake-up of phe examinee population needs 
examination* It is entirely plausible^- for example^ that the more advantaged 
students represented in most of the studies cited in the College Board 
booklet;. Effects of Coaching on Scholastic^Aptitude Test Scores (most of 
them cqndu<:ted in the, 1950s) ^ were already well prepared to dq^their best 
on the SAT *to a degree that cannot be assumed for an increasing proportion 
of the current candidate population, particularly minority an^ other stu- 
dents outside the mainstream of educational opportunity* Finally^ there 
have been stu(|ies of instruction directed eith&r to the SAT or to closely 
related tppics* such as the ^^coachability** of ^verbal analogies that have 
appeared since the College, Board booklet was published that need to be con-* 
sidered in current thinking and general st^atements regarding instruction 
for the SAT* , , , * 

SCOPE OF THE REPORT , ^ ^ ' . 

The I'iBerature review will cover two interrelated areas of study; (1) stud-^^ 
ies of short-tenr instruction (STI) and intermediate-term instruction (ITI) 
directed specifically toward increasing t^st scores, i^th particular empha-* 

sis^ oiL^the SAT-rV and SAT-M; and (2) studies of - testwi-seness (TWX - 

that were not specifically directed to raising test scores* The review of^ 
the literature will be followed by recommendations for future research*^ 
Two^ topics will be considered nexx Vhat should help clarify the sub- 
sequent review of the literature and facilitate the discussion of its im** 
plications: the components of observed test scores as they ^relate to 
questions of short-term instruction (STI) and testwiseness (T17); and 
'definition of terms* 

I 

COMPOHENTS OF OBSERVED TEST SCORES 

Implicit in many discussioris of STI and TW is the assumption that an in- 
dividual's test scfore is essentially a composite of the ability or 
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knowledge for which a person is being tested, , testwiseness, "error' -chance 
f^ctorfs in the^ sampling of test items — lucky guesses — and so on. This 
assump^^ion is often accompanied by the belief-that the intended "real" or ^ 
**true'* score on aptitude tests such as the SAT-V' and SAT-M is necessarily 
^ (by definition) subjec-t ojily to gradual, long-term change and, as a ' ' 

corollary, a distrust or suspicion of anything. that might alter aptitude ■ 
, test scores i^i a relatively short term (i.e., STi) . To put this question 
in perspective, it is useful to consider the following delineation of the 
components of observed test scores. 

A. "True score" cc»npone^ts: e.g., verbal aptitude, mathematical aptitud^. 
1. A composite of underlying knowlWge (e.g., vocabulary, elementary 

algebra) and reasoning ability, developed over a long peridd pf time. 
^Long**term acquisition, long-term retention.) ' ' 

^. A state bf being wQll-*reviewed, so that the performance to be demon- 
strated is in line> with the individual's underlying developed 
competence. (Short-term acquisition^ short- or medium-term retention.) 

3c Integrative learning, overl earning, consolidation. (Short-term, 
acquisition, long-term retention.) * ^ 

4. Learning* criterion-relevant, analyl;lc skills ^e.g. , how to identify 
the main idea of a paragraph; how to. simplify complex quantitative 
terms before comparing their value). (Short-term acquisition, long- 
term retention.). 

B. Primary test-specific components. 

1. The match between developed abiMty ^ (including the various score 
comp6nents listed in A above) an^^ test content. Mismatches may occur 
as gaps in' such areas as skill in locating information in reading 
passa^ges and' ability to work with the al'gebra of inequalities. 

2. General Tw-^-test familiarity, pacing, understanding of general direc- 
tions, general strategies for using partial information, and so on. 

3. Specific TW — components similar to B2, but in reference to charac- 
teristics of specific it^m formats' (^uch a^ verbal analogies and . 
iluantitative-comparison items), and other item characteristics. 

C. Secondary components influencing test tdking. 

1. Level of confidence. ' - ■ . 

2. Level of efficiency — the ability to use available knowledge and 
reasoning ability 'quickly with'a relatively low rate of error re- 
sulting from^ working rapidly . 

D. "Error. "'Fluctuaftions in attejition, sampling error, variations in luck 
when guessing, etc. ^ ^ 

SOME' DEFINITIONS AND CONCEPTUALIZATIONS \ ' 

Terms such as STI, ITI, coaching, TW, guessing, Wd the "aptitude*' versus 
'^achievement" distinction' are central to discussions regarding special 
preparation for test taking, and the meanings of these terms tend to vary ^ 
from one writer to the ne»|;. It will be useful, therefore, to give a brief 
definition of each, aa used in this review, and to expand on the conceptu- 
. ali^ations where needed. 

Short-term instruction (STI) . The term STI will refer to attempts to 
improve test scores by means of a relatively *;hort period of instruction; 
relatively 'short, that is, when compared to the amount of time generally 
considered necessary for any substantial change in the ability or knowledge 
in question.^ STI may be directed toward any or all of the components of 
observed te'st scores noted above except .true-score component Al, which is 
by definition limited to long-term acquisition* Note that STI for compo- 

^ » r 
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nents A2, A3 and A4 'is ii,fact directed toC^d the ability of .interest, 
even thbughme iustructibn is shart-ternu It ray be added that ip .general 
there^is no sharp contrast between educ^yon and STI^ given appropriate ^ 
contept^ S^I^ may properly be viewed' as instruc^tion provided in addition. * 
to, ratiher than instead of, lconyentioti«l ^long-^term learning (i<e<, compon^ntf 
AD. , • ' X* ' . ^ . . ' ' 

Intermediate- term instruction (ITI) . As the name Suggests, ITI will ref^f 
to attempts to improve' test scores by means of special instruction for a 
somewhat longer period than STI but still a short pe^^fod compared to the 
amount of time gen^ralljr considered necessary for substantial changes in 
the ability in question* Except for the difference in the- relative period 
of instruction, the description given of STI also applies to ITI, ^ 

. Coaching , This term will refer to a subset of possible STI activities 
limited jessentialfy to very^'brief instruction in general testwisene&s, such 
-as effective pacing, answering items whenever partial information about 
them is known^ and practice in answering question's similar to those in the 
'target examination. Specifically not included in this definition of coach- 
ing is ^ny content instraction beyond that which is merely incidental to 
the practice sessions. This definitif>n is implicit in the College Board 
(1:968) statement on coaching,* in the design of most of the studies re- 
ported there, and in:' the interpretation of their results. It has been 
fairly widely adopted, as is indicated in a recent statement on coaching 
made ^y Anastasi : (1976) : "Item types on which performance can be appre- 
ciably raised by short-term drill or instruction of a narrdwly^imited 
nature are not included in the opera,tioi!al forms of the (SAT) tests" 
(p-^.43)* ' . ' 

Testwiseness (TW) , In essence, TW is a set of skills and knowledge about 
test taking that enables individuals to display their abilities (e,g,, 
verbal and mathematical aptitude) to their test advantage, A TW component 
is by no means unique to standardized tests. It Is also. present in other 
modes of asseonment such as classroom recitation and essay writings 

Early recognition bf the TW component in SAT scores is evident from the 
fact that "From 1926 to 1944 candidates were required *to present completed 
practice booklets before they werfe allowed to take, the test" (Fremer and 
Chajidler, 1971- p, 147), TW instruction* is sometimes viewed primariljr as 
an effort to beat the test, with the assumpticFn that testwise examinees 
will somehow ^t higher scores than they deserve. For well-made staridard- 
ized tests, however, clues that offer spurious rotites to correct answers 
are scrupulously avoided, and the opposite, moie compelling concern is that 
examinees who are not testwise may receive inappropriately low scores. Thus, 
Stanley (1971, p, 364) uses the contrasting term "test-naivete," afld Ebel 
(1^65) notes that "More eiror in measurement is likely to or;LginaJte from 
_the^ students who have had too little,, rather than too much, skill in talcing 
tests" (p. 206), . ^ 

Guessing , Stated simply, guessing consists of answering a test question 
in the absence of certainty as to the correct response. It may be divided 
into three categories: ^guessing that is blind jor randont, guessing that is 
spurious or based on a hunch, and guessing based on partial .information. In 
contradistinction to the common feeling that guessing is at' least faintly 
disreputable, the following four: points should.be noted. 

First, guessing is necessary for responding appropriately to the SAT and 
to most kinds of assessment. Host examinees encounter some test questions 
about which they have partial information th^at would enable them to elim- 
inate at least one choice. In such cases^ th^y must guess among the remain- 
ing alternatives if they are to benefit from their partial information. 
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Their guessing in such instances benefits tht only them but the users of 
the test scores as well, because^only when 'partial information is used is 
'it possible to. give greater .credit to those who are partially informed with 
respect t^^^ given question than to examinees who are uninformed about it, 

Sacond, although not BvePyone would agreed guessing would appear to be 
ap]&ropriate in situations such ss takii^ the SAT, This, point may be clatiy 
fle(| by descril^ing contrasting situations, I£a student is taking an "open- 
.book" examination, or is writing a term paper, it would indeed be un- 
schcflarly and inappropriate to guess or to gloss over points of uncertainty 
rather than seeking out the needed information. On the other hand> guessing 
may be inappropriate in a'^^esting situation in which ^the requiired informa- 
tion has been ^clearly specified ahead *of time, and'mastgry ot" that informa** 
tiOn emphasized, -This would be particularly true if the. test4nig pri5bedures 
used are consistent with this situation, and gue|^sing on the test* is' actively 
discouraged, H^weve^(, aptitude tests' such as the SAT, and even typical large-* 
scale standardized achievement tests, present a te£t-taking situation that 
^s markedly "difiei^ent,, There is not a clearly specified listing of points of 
information to be mastered, and^of course there is no oppbrtunitjj- for seeking 
additional information afi is the case for "open*-book'V bests. Thus, th^ test . 
situation, including accompanying directions about guessing,/ taakes it appro- 
priate to guess ywhen ar\s\?ering SAT items* * 

Third, it may be argued that deajLite the misg'v^^ngs of some educators^^ 
guessing on tests such as the SAT is .notvant^thetical to good decision 
making or good scholarship, In^most enterprises , whetlier building bridges 
or investigating theoretical problems, tha point is necessarily^^eached 
where information gathering must be teiminated and estimations, educated 
guj^sses, and the like must be resorted to, * 

Finally, the net result of guessing on the' SAT is fair ; over *a set of 
items, partial credit is received for u^ing T>art4.al^information, 

Aptitude versus achievement testing . The literature on STl and TW'is^. 
sprinkled with allusions to differences between aptitude and achievement 
tests* generally imiicating that STI effect^ are both' more likely and more 
accfeptable for achievement tests than for aptitude tests. Essentially, the 
distinction is that aptitude tests are more general, more oriented toward 
' reasoning, and less curriculum-bound than are their jachievemeDt test 
counterparts. The distinction becomes problematic when it is then' suggested 
that aptitude tests, definition^, should be relatively impervious to STI* 
With regard to the components of Qbserved test scores noted above, this 
heed .only be true for component Al, Component A2 <effecti<ve review) theo- 
retically allows for STI effe^cts on aptitude test> scores, becau^ as Carroll 
<1970) has observed, *'The SAT is in truth a test of devieloped abilities, 
depending both on general intellective capacities to learn and on an 
* accumulation of knowledge and skills , acquired through education in, and 
experience with, the verbal and mathematical aspects of this nation*s 
culture" <p, 2), STI components B2 and B3 (general and specific TW) apply * 
more poJtentially c^f aptitude tests than to achievement tests to the ex- 
tei\t that aptitude tests more of ten "resor,t to more complex item formats 
such as^ verbal analogies, data suf ficieni;Ly items, and quant^itative co|ji^ 
.parisons, Theye appears to be aft' increasing tendency toward seeing tne 
distinction between aptitude and achievement testing as on^ tl)at is rela- 
tive rather than categorical, particularly with regard to the mathemati- 
cal area, . ■ 



11 



Uteroture Review 



Because research regarding the SAT^ is more definitive than that di- 
rected to the SAT^V, the two will be reviewed 'in^ that order, 'Selected 
studies^of instruction dlrecl:ed to other ^aptitude cests and subtests and 
to achievement tests will be considered. Finally, studies examining 
selected aspects of TW will l?e reviewed, 

INSTRUCTIOtI FOR THE SAT-MATHEMAXICAL 

'Studies of STI or III directed specifically to increasing scores on the 
SAT-M will be considered in chronological order. Those conducted prior to 
the Pike and Evans (1972) report will be considered only briefly, because 
they have /been summarized elsewhere (College Board, 1968; EVans and Pike, 
1973; Fremer and Chandler, 1971;' Pike* and Evans, 1972J, 

The first sir of these s,tudies (Dyer, 1953 a,i); French, 1955 a,b; Lass, 
1958; French and Dear, 1959; Frankel^ 1960 a,b; Whitla, 1962) all involved 
the use of SAT pretests and posttests. The period ,of time devoted. tb STI 
followed a typical format chosen by the instructors, but generally con- 
^^isted of group practice with test items' similar to those appearing in the 
SAT-M^< All reached the conclusipn^that scoife gaii|^s attributable to coaching 
wer^ not sufficient to justify having students invest time in such instruct 
tion to improve their scores^ In some of the studies of particular sub- 
groups of students and/or particular kinds of itfems there, were instances of 
meaningful score gai^ns, These instances* (as well as any other exceptional 
binding or observation) will be noted for each of the studies. 

Of the lasC'four of the studies directed to Increasing the SAT-M scores 
(Marrort^ 1965; Roberts and Oppenheiia^ 1966; Pike and Evans, 1972; McCarthy, 
1976), all 1>ut tha second differ from the first six studies, particularly 
in that they give emphasis, to mathematics content review in addition to 
other kinds of STI or ITI, The Roberts and Oppenheim study "cfiffers from all 
the others in focusing on students considered to be academically disad- 
vantaged* V 

D yer study ^ In this study, coached students (239 boys) averaged 13 
joints greater gain on the SAT-M 200 to 800 scale than was observed among 
the '229 control students in a similar preparatory" school* The effect was 
considerably greater, 29 pofnts, 'when th^- comparison was made for students 
who had taken no mathematics as seniors.' The iij-polnt and the 29-polnt . 
differences were both statistically significant. Data in the appendix 'to 
the Dyer-^ report indicate an average gain of about 15 SAT^'M po^ts for the 
total ^o^p of control students. 

French study . Here^ an overall gain of 18 SAT-M points was observed 



comparing coached students* gains in one sr*~ool to those for control stu- 
dents in two other schools. Boys not. currently taking mathematics gained 29 
points when compared to one control group, and 9 when compared to another; 
those taking mathematics gained 19 points and 5 points for the same com- 
parisouj. Paradoxically, gitls not taking mathematics gained either 5 points 
or 1 point, whereas those who were taking mathematics showttd a coaching 
effect of 30 i>oints or 20 points. A plausible explanation would be that in 
both studies the boys currently taking mathematics had a "ceiling effect" 
on the benefits of review, but that those not taking mathematics were able 
to get maximum benefit from coaching. For the girls, on the other hand, it 
may be that those notTTafeing mathematics were victims of what Tobias <1977) 
describes as "math Anxiety," since they appeared to derive no benefit from 
the brief reviaw that was provided. 

Lass study . Comparisons were made of gains between junior and senior 
year SAT-M scores for students who received no coaching, those who re- 
ceived outsi. coaching, and those who received a school-provided orienta- 
tion program. The lattar made students familiar with SAT testing proce- 
dures an4 test content but did not involve extensive drill on multiple- 
choice test questions or other typical co".ching activities. SAT-'M score 
gains for the three groups were 53, 6A, and 52 points, reopectively , from 
junior- to senior-year test administrations. Thus, there was a slight ad- 
vantage for receiving coaching. Perhaps more notable are the sizable 
changes for all three groups compared to the 15- to 20-point gains ordi- 
narily-observed over this interval of schooling. 

Dear study . This study, reported by Frepch and Dear <1959), vas de- 
signed to be more intens^ re than the earlier studies. Classes were much 
smaller (two students in each), and more time was allotted. However, 
specifics "of content and form of instruction were again left to individual 
teachers, and the assumption that classes of only two students are optimal 
is not necessarily true. For students not currently taking mathematics, 
those receiving coaching gained an average of 28 points more than those not 
coached. For students who were taking mathematics, the average gain attrib- 
uted to coaching was only '6 points. . ' 

Frankel study . This study involved student^ at the Bronx High School of 
Science, which had a record of sending 98 percent of its graduates to 
college. Nearly ^11 students take four years of mathematics. In this study, 
coached students received 30 hours of instruction from a commercial coach- 
ing school. Those who were coached were reported as experiencing a 9-point 
loss when compared to the controls. However, frankel also reported the gain 
scores for both groups, rather than simply the difference between "the two. 
Control subjects gained 66 points between the May arid December or January 
SAT-M, compared to 57 points for coach<>d subjects. These changes, when 
compared to an average cha-ige in SAT~M ^coTces over a similar intifval of 
15 points for ovex 1.6 million students <Pike and Evans 1972, p. 5), sug- 
gest that the faculty at Bronx High School of Science were already doing 
exceptionally well in preparing students for taking the test, whether 
directly or indirectly. In such a school, there is evid^ently little need 
'for any addition* ^ preparation for test taking. 

Whitla study . Like Frankel, .Whitla examined the effects of commercially 
provided coaching for the SAT over a similar time interval between pretest 
and Dosttest. OoaChed students showed no SAT-M gain between the second 
pret.ist and the posttest; control subjects gained 6 points. Control sub- 
jects were volunteers ^in the same schools attended l)y students who had 
elected to obtain instruction from a proprietary organization. 

Marron study. The effects of intensive ITi directed to the SAT were 
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than '^he coaching* Once again, all three groups showed sizable average 
gains* 

Dear study * This stu^y (French and Dear, 1959) showed essentially no 
effects due to SAT-V instruction -for coached students* 

Frgnkel study * Among students at Bronx High School of Science, un- 
coached students gained 38 SAT-V points compared to 47 points for those 
idio received commercial instruction^ a difference of only 9 points* Gains 
for both groups were larger than typically observed for May-December/ 
January score changes (18 points on the^ average for SAT-V) , which again 
suggests an accelerated rate of growth at this schcol* The SAT-V gains 
were not as pronounced as for the SAT-M, on which controls and coached 
students gained 66 and 57 points respectively*. 

Pallpne study * pallone (1961) reports the effects of two programs of 
instr uction for the SAT-V> one STI and the other ITI* He did not attempt 
instruction for the SAT^* pallone deliberately designed instruction to go 
beyond the "coaching** that has FjO regularly been found ineffectual and 
focused instead on the reading, vc abulary, and logical reasoning abil^^ 
ities that the, SAT-V is assumed to measure* The STI was in the form of a 
very systematic study program involving instruction in intei^sive' reading 
skills, skimming, critical reading, reading Comprehension exercises, and 
the analysis of verbal analogies and was provided in daily 90-ininutti > 
sessions over a six^week period* Thus, it seemed nfost directed to score 
component A4 (learning criterion-relevant analytic skills) and also 
cp,vered component A3 (integrative learning), Bl (filling in gaps in 
developed ability), and B3 (TW specific to analogies). 

Th^ 20 participating students showed an average gain of 98 SAT-V points 
Because theve were no control subjects there is no direct way to subtract 
from this the effects of pra'^tice nd growth in order to estimate the STI 
effect* Psing the gains expc* nC6y by controls at the Broni High School 
of Science as a rough (and pi Dably conservatively high) estimation of 
control subject gains,, the effects of STI in the pallone study ^ould be 
estimated a^ approximately 60 points* 

The ITI progratn im*olved daily 50-minutft instructional periods over a 
five-month interval* Program content was similar^ to the STI except for 
substantially greatei: ampunt t>f instruction* About 80 students completea 
. the ITI program, and for these the "average SAT-V score gain was 109 points 
The 20 students receiving STI also received the ITI, and there was an over 
all score- gain for these students of 122 points* ' ^ 

Whitla study * Students receiving commercial instruction for the SAT-V 
gained 11 points more than the control subjects between pXetest and post*- 
test* (Controls gained 20 points between the two testings, and 39 points 
-altogether between the pre-pretest junior-year SAT and the posttest taken 
as seniors*) * . r * 

f Marron study *^ Following intensive ITI in the 10 preparatory schools, 
the average SAT*-V score changed from 471 to 528, a gain of 57 points* 

Roberts and Oppenheim study * Volunteers in six Tennessee high schools 
were randomly assigned to a PSAT^V instructional group or to a control 
group* Mean PSAT^V pretest scores were equivalent to about 315 on the 
SAT^V scale* As with PSAT-41 instruction, programmed instruction was pro*- 
vi4exLaTLJ5-JiaJ^hoxu:_^^^.4Qns* Instru cte d students gained the equivalent 
of 7 SAT-V points, and controls lost 7 points* The control group's loss of 
points was apparently due to motivational problems* 

Coffman and Neun study (1966) * This study was undertaken to determine 
the effect of a presumably typical accelerated reading course, on SAT-V 
sco^res* Three groups^of college freshmen took part in the stu<iy, "*each 



receiving 45 to 50 houis of instruction as part of a college-'credit course 
emphasizing speed with relative accuracy* There were no cor>trol' subjectt** 
Mean score changes were +4, +10, and --29* The last change is statistically 
significant, suggesting that instruction for that group may actually have 
hindered effective performance on the SAT-V* The authors described the 
results as being in disagreement with Fallone's findings* However, si^ce 
the instruction appears to lack most of the features provided by Fallone 
for increasing verbal reasoning powers, the two studies seem scarcely 
comparabl e* ^ 

INSTRUCTION FOR TESTS OTHER THAN THE SAT 

Two studies (Marxon, 1965; Jacobs, 1966) involving instruction for the 
College Board English Composition Test (ECT) are relevant to the, question 
of instsruction for the SAf-V* It may be noted, for example, that two of 
the four item formats used in the SAT-V (reading comprehension, and ' 
antonyms) could as well be viewed as testing the attainment of reading 
skills and vocabulary respectively* Furthermore, the ECT contains com- 
" plicated item formats, and, as a result, instruction directed in part to 
the relevan^t TW COTiponents may have implications for TW instructions for 
other relatively complex forra^^s such as verbal analogies in the SAT-V 
and data sufficiency or quantitative comparison items in the SAT-M* 

'Two additional studies (Moore, 1971; Whitely and Dawis, 1974) are 
addressed specifically to questions regarding instruction for answering 
analogy items* 

Marron study * Ot the students taking SAT pretests and posttests in the 
^iarron study, 347 al^o took the ECT on both occasions^ The average gain on 
the ECT 200 to 800 scale was 83 points, from a pretest score mean, of 458* 

Jacobs study * student volunteers in each of six schools were randomly 
assigned to a group receiving instruction or a control group* The SXI, con- 
sisted of six three-hour sessions* About nine hours were spent directly on 
"criterion skills (score components A2, 3 and 4), and about ninfe hours on 
specific TW (score component B3) related to item format* In each school, 
specific 1*W was directed to two of the three ECT item formats '(sentence 
correction, construction shift, and paragraph organization) * ^The ECT was 
administered only after the experimental subjects had received instruction* 
In two of the schools, in>?olving to.al of 36 students receiving STI and 44. 
control students, there were only negligible differences between scores for 
the tw groups* In tl>e other four schools, involving 91 instructed and 87 
control students, mean differences ranged from 44 ECT points in one of the 
schools to 73 points in another* Such clear evidence of STI effects occur- 
ring in some schools but not in others suggests that the specifics of STI 
provided by different instructors may have a ma "(ted effect on the outcome 
of an STI experiment* . * 

Moore study * Instruction for answering verbal analogy items was provided 
to graduate students by a booklet directed to two aspects of the task: 
understanding the format of the question, and learning to recognize ■ 
specific classes of relationship* The 38 subjects were randomly assigned 
to an experimental or a control group* A 75-item analogy test with a 
somewhat^ more cumbersome format than that used for the SAT-V was subse-* 
quently administered* Students receiving STI averaged 44*3 -items correct 
compared to 39*7 for controls, a difference of about three-fourths of a 
standard deviation* The number of subjects was v^y small, so these re- 
sults should be considered tentative* If the findings replicated, however, * 
they Would demonstrate that even brief instruction to relatively sophisti-* 
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cated examinees can make e difference in perfoimance on verbal analogy 
items. , ^ - 

Whitely and Pawis Study . Ttie subjects were 184 students randomly 
selected from the class lists of two inner-city high schools in-St. Paul, 
Minnesota. Those selected were randomly assigned to one of five treatment 
groCips or to controJ group. Verbal analogy items used for the study had 
an unusually low vocabulary level, so that answering the items vfould de^ 
pend primarily on the ability to educe relationships rather than on word 
knowledge. The pretest and posttest each consisted of a 41-iteia analogy 
test. Fifty analogy items were used for all five treatments. One treat- 
ment involved practice on'' the 50 items without feedback, and another in- 
volved practice with feedback of the correct answer. The other thre* treat- 
ment groups also had practice with the 50 items, with instruction inter- 
spersed between item subsets that was addressed primarily to helping 
students learn to recognize such categories of relationships as **opposites,'* 
*Vlass membership,'* and 'Afunctional. '* The three groups 'receiving instruc- 
tion differed in that one was instructed under the condition of feedback 
_ana structural aids (in which 10 additional analogies were presented with 
structural labels and arrows indicating the related pair), another with 
feedback only, and the third with structural aids only. It was found that 
the only experimental group to perform significantly better than the con- 
trols was the one receiving instruction combined with both feedback and 
the diagrammatic structural aid^ All six .groups had pretest means of about 
24 and sSj^^ndard deviations of about 9. The control group gained about 2.3 
items correct, the ^'instruction plus feedback, plus rStructure'* group gained 
about 6.3, and the other groups between 3. 4 , and '4.0 items correct. 

These results indicated that well-designed STI (only 50 minutes ^were 
used for the intervention) can sometimes meaningfully increase performance 
on analogy itemff, and that practice, beyond that^ obtained in taking the 
pretest, even with feedback, had no meaningful^ effect unless it was sup- 
plemented by carefully designed instructional materials. . 

STUDIES EXAMINING TW ' ' ^ 

The topic of TW is frequently investigated in studies not involving in- ^ 
struction for specific tests or subtests. Because of thfe importance of TW 
as a component of test scores, and the implications of Jthis component re- 
garding test validity and fairness, some of the. general findings in these 
studies of TW will b^ reviewed here. These will be clustered in several 
categories. First will be studies or commentary relevant to adequately 
defining TW. (Alker^ Carlson and Hermanii, 1967; Grehan, Koehler, aid 
Slakter, 1974; Diamond and Evans, 1972;. Ebel, 1965; Millman, Bishop, and 
Ebel, 1965; Stanley, 1971). Sedond will be the topic of guessing (Cronbach, 
1970; Dianiond and Evans, 1973; Flaugher and Pike, 1970; Lord, 1964; Lord, 
1975; Slakter, 1968 a,b; Pike and Evans,. 1972; Pike and Flaugher, 1970; 
Thorndike, 1971). Third is the related topic of fisk taking (Slakter, 1967; 
Slakter, 1969; Slakter, Crehan, and Koehl^er," 1975; Swineford and Miller, 
1953). Fourth is another topic related to guessing, that of answer changing 
(Bath, 1967; Jacobs, 1972; Lynch and -Smith, 1975; Mueller and Schwedel, . 
1975; Mueller and Wasser, 1977). The fifth topic is TW related to. particu- 
lar kinds of items. These include studies of verbal analogies (Connolly 
and Wantman, 1964; Gentile, 1966; Gentile, 1968; Gentile, Kessler, and 
Gentile, 1969; Willn^r, ]9^4), and of reading comprehension items (Pyrczak, 
19M; Vernon, 1962). 

On defining TW . Earliar in this paper, TW was defined as **that set of 



skills and knowledge about how to take a particular test that allows in-* 
dividuals to display their abilities to their best advantage." It will be 
useful at this point to co^isider other definitions, explicit or implicit, 
commonly used in the^esting literature when discussing TW^ The definition 
ost oftefn encountered in the literature is that proposed by Millman, 
bishop, and Ebel (1965); "*Test-wiseness* is . , . . a subject*s capacity to 
utilise the characteristics and formats of the test and/or the test-taking 
situation *o receive a high score. Test-wiseness is logically independent 
of the examinee*s knowledge of the subject matter for which the items are 
supposedly measur&s" (p. 707). ' 

Implicit in both definitions is the possibility that some aspects of TW 
are necessary if examinees are to receive proper credit for the knowledge 
or ability being tested, and that other aspects of TW may allow examinees 
to receive more credit than' is their due, i.e., the tesjt-sophisticate may 
be able to "beat, the test." The Millman et definition appears to elicit 
the latter concern. Typical of reformulations of ,their definition is that 
used by Diamond and Evans (1972)* who define TW as . . the ability to 
respond advantageously to multiple-choice items containing extraneous clues 
and to obtain credit ,on these items without knowledge of the subject matter 
(p. 145). Another instafice of ^picking up on the beating-the-test aspect of 
the Millman et al . definition is found in Alker et^ al. (1967) who state; 
"Defined in this way (Millman ^ al.), testwiseness emphasizes ^the use of 
the format of the test rather than its content to achieve higher 
score. . . (p. 11). Note that Khey could as well have said '*in addi- 
tion to" instead of "rather than." On the other hand, awareness of a need 
for the opposite concern, particularly with respect to well-constructed 
objective tests, is evident in statements by writers such as Ebel and., 
Stanley, as was noted earlier. With regard to tests such as the SAT, ,a 
concern that examinees should have the required TW to cope well with the 
test as a vehicle through which they are to demonstrate their verbal or 
mathematical ability would appear to be more compelling. This is in keep- 
ing witl^ the recommendations of Cr$han et_ al . (1974) who, upon demon- 
strating that some examinees are consistently low on TW across tests, noted 
that TW can never be fully eliminated as a component of standardized tests 
and suggested that . . perhaps more thought should be given to the 
teaching of tw to students low in tw" (p. 211). 

Studies of guessing . A central part of TW, is knowing when and how to 
guess, where guessing is defined%s answering a test question in the ab- 
sence of certiainty' as to the correct response. ' The problem is especially 
troublesome for objective tests of ability, in part because multiple- 
choice questions heighten our awareness of the guessing component, and in 
part because a general test of ability, especially if it is of an. appro- 
priate difficulty level for a given examinee, will havef many items for 
which the examinee is neither certain of the correct answer (and therefore 
has no need to guess) nor so totally uninformed as. to be reduced to blind 
guessing. Although blind or random guessing is the kind that comes to mind 
initially and is discussed most often, it is probably the least likely to 
occur. Most guessing decisions will involve choosing whether to answer a 
question or not, when, the basis for doing so is either ^partial informa- 
tion or a spurious hunch or feeling. . * ^ 

Much of the research literature on the question of guessing on objective 
tests is focused on the use or nonuse of a "correction formula" for guess- 
ing^. Diamond and Evans summarized this literature in 1973 and found little 
basis for any conclusive answers. When not certain o£ the answer to a ques- 
tion examinees vary considerably in their willingness to guess, even when 
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'favoring guessing (it appeared easier tp inhibit guessing than to encourage 
it); and (3) there was little relaticmship ^between guessing, risk taking, 
and ability*^ , . ^' 

Various measures of risk taking have been compared by Slakter (1967). In 
a later study Cr969) he pointed out that examinees who do not take risks 
tend to be penalize^ on test scores, and he also noted that this tendency 
usually generalizes, across different tests. Still more recently Slakter, 
Creh^n, and Ki^ehler (1975) reported on a longitudinal study of RT tendency. 
They found again that RT was relatively .stable across tests for an indi- 
vidual examinee at a given time, but found longitudinal changes that point 
to the fact that RT tends to decree-se^over grades 5 to 9 and then becomes 
relatively, stable, at least through grade 11. Th6y noted an important im- 
plication of this finding, that the contribution of RT strategy toward 
maximizing test .scores actually tends to become less between grades 5 and 
9. 

Studies of answer changing . Yet another ^pect of the-question of 
guessing, which is more in the realm of how to guess than when, is that of 
anbwer-changing behavior on multiple- choice tests. The topic is of interest 
because student opinion and much of the advice given by educators runs . 
directly counter to most research findings. Two excellent summaries on the 
question are provided by Lynch and Smith (1975) and Mueller and Wasser 
(1977). Among the more recent studies of interest are those of Bath (1967) 
Ja^:obs (1972), and Mueller and Schwedel (1975). The following conclusions 
emerge from these studies. 

1. Most exaEiiinees express the belief that it does not pay to change 
^ answers. 

2. Most exarain^e^ do cha;.ge ariswers but typically oa only about 4 per- 
cent of the questions. 

3. In fact it generally does pay to change answers. Typical findings 
are that there are about two favorable changes for avery unfavorable change 

4. Gafhs drop off as items get relatively more difficult.^ 

5. Higher scoring examinees t^nd to benefit more from changing answers 
thap do those who score lower. 

Studies of tastwiseness (TW) for specific Item types . The "how" of 
effective guessing becomes particularly central when attention is given to 
^ specific kinds of items^ especially those that are relatively complex. In 
surveying studies of TW, studies directed specifically to reading compre- 

. hensioa items and tQ verbal analogies, both of which are relatively com- 
plex, were found particularly relevant. 

Vernon (1962/ examined the assessment of reading comprehension of 
British and American examinees by comparing free-response data from essays, 
fill-in sentences, etc. to multiple-choice responses. He found a test- 
sophistication factor in the multiple-choice responses of British exam- 

^ ineas who were generally unfaEniliar with such tests that was much less 
evident in American responses. The difference was more proiiounced for the^^ 
reading comprehension items than forthe more straightforward vocabulary 
questiQ|is. Fyrczak investigated an .intriguing aspect of testwiseness by 
studying the effects of answering test it^ms for reading comprehension 
independently of the accompanying passage. In one study^ (1972) he found 
that examinees vrely on various sources of information and misinformation 
when answering su^h questi&ns in the absence of the reading passages and 
also make use of interrelationships among the items in a given set. Ina 
subsequent study (1974)^ he reduced these sources of answering strategy 

- and found that examinees :were still able to perform at a better-tUan- ^ 
chance level, presumably by such devices as selecting statements of 
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general principles ratber than specific facts, and by selecting tbe nK>st 
general of several principles presented. 

Willner (1964), working with analogy items dram from a wi^ie variety 
of tests (Miller Analogies Test, Army Alpha, Otis Beta, etc.), found that 
about half the items could be answered correctly on the basis of word 
association alone, i.e., without having to educe relationships for a given 
item and then solving the analogy on. the basis of the educed relationships. 
He recommended that analogy iteios that are substantially free of the word- 
association effects on analogy solving be constructed and used in tests. He 
added the impressionistic observation that in some instances word associa- 
tions led to the wrong answer, and that some examinees who might have 
solved an analogy on the relational basis appeared instead to have been 
distracted from doing so by the strong associational attraction of one of 
the error choices. 

Connolly and Wantman (1964) used "think aloud" data elicited from nine 
subjects in solving verbal analogies to observe analogy-solving processes. ^ 
ITie observations were largely impressionistic. Two impressions were relevant 
to the present review. First, the words provided in the alternative "answer 
choices influenced how the stem words'were interpreted. The subjects were 
often observed to revise the relationships they had established for the stem 
pair of words to fit the demands of the first option. Second, it was observed 
that the subjects seemed to differ considerably in their methods of, attack- 
ing or analyzing test items. Both observations are relevant to score com- 
ponent B3, "specific TW," and the second has implications regarding. 
(Component A4, "relevant analytic skills."' 

, Gentile (196^6) and Gentile, Kessler, and Gentile (1969) have also ex- 
amined performance in solving verbal analogies (drawn from re*;ired SAT-V 
items), giving primary consideration to the amount of score variance 
attributable to word associations. In. the 1969 study "associative related- 
ness" was fouhd^ to account for 28 to 50 percent of the score* variance. 
Their discussion suggests that they consider the effect of associative 
relatedness to be an inherent part of analogy iteirts, a position that con- 
trasts with Winner's discussion in which the availability of an associa- 
tional basis for answering analogies without resort to educing relation- . 
ship3 is viewed as a problem that can be remedied by changes in test 
construction. Gentile (1968) also examined, the effect of sotlocultural 
level and the knowledge o^ definitions on analogy solvli^g. The latter was 
done by observing the effect of providing definitions of words appearing 
in the analogies. He found the effects both singly and in combination to 
be^ weak. ' • ■ 



Summary and Infefprefafior^of Findings 



Some g%n"el:al~conVi"deFat^^ be noted first that provide a useful frame- 

work for doing so* Following (:hat/ findings relevant to the SAT-M and those 
having a bearing 051 the SAT-V will be considered, with results derived from 
studies of tests other than the SAT cited where appropriate* Findings from 
studies of TW will be summarized last^ 



GENERAL GONSIUERATIONS 

Design characteristics * The first consideration is that of the basic design 
features of th& studies themselves* The studies differ substantially in such 
important variables as the use or nomise of control groups, the selection 
of control groups (ranging from using groups of students in schop^s gener-^ 
ally comparable to those the experimental* subjects are in, to the use of 
random assignment), the number pf subjects, and the use either of pretest 
and posttest data or of alternatives. to that procedure* 

Summarizing mixed' findings * The next consideration is the question of 
how Research findings should best be interpreted, particularly when making 
comparisons across studies* In principle, a single sturdy showing* sub-r 
stantiai positive gains cannot &e countered or rfefuted by agy numbfe^r o£ 
studies failing to get positive results* The otily near exception would' 
oqcur in the event o£ a well-designed replication study that^' failed to show 
similarly "positive results*. In ,thd:t case, there would-be^a^Jiscrepancy 
needing further study and resolution*' Similarly, it wouli be fallacipus to 
infer, from mixed results across studies on a topic such as STI effects^ 
that across-study inconsistencies justify the conclusion that .there. are 
no meaningful effects* As exemplified in Jacobs* <1966) discussion of 
differences on' English Composition Test score changes from one experi- 
mental group to another, mixe^d results can mean that an effort should be 
made to find out why instruction was effective in some places but not in 
others*" This observation is particularly true when making comparisons 
between studies in which little account. was taken of either examinee br 
instructional characteristics* A th|.rd observation is that there .has .been 
a considerable empha<;is in most discussions of STI on the overall magni- 
tude of its effects, with little consideration given to differences^ among 
examinees, STI curriculuihs, or item formats and other item characteristics, 
especially .when, stating final 'Conclusions* * T 

tfie tendency to consider only overall average resirlts of STI, together 
with a polarization of attitudes toward STI as being essentially good or 
bad^, has tended to distract attention from anS^lyses and interpretations 
that'could lead to ^more cumulative, orderly base of information regard- 
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her underlying developed competence* We see, then, that it is the nature ' 
rather than the amount of STI gains that determines whether they may be 
properly considered as ejccessive* 

Score component A-3 (integrative learning, overlearning) again pre-* 
supposes prior learning at some reaspnable level of mastery* Component 
(knowledge of criterion-relevant, analytic skills) is at least concep- 
tually subject 'to STI .effects, and if this component were shown to l>e sub- 
ject to STI it would in no way invalidate ^he test* Evi^Jence that such 
abilities are subject to STI should give us more comfort than discomfort, 
although it would heighten our awareness of possible disparities in the 
quality of education, whether short term or long term in acqulsiclon* 
There is little hard data on the topic* It was addressed" most directly', 
perhaps, in the work of B;Loom and Broder (1950) and less directly with 
regard to the SAT-M by Pike and Evans (1972) and for the SAT-V by -Pallone 
(1961)* The paucity of data suggesting STI effects 'for compone'nt A-4 may 
suggest that meaningful gains in'' this realm, even given exc^lent instruc- 
tion for teaching these analytic skills, a^e likely to be observed only 
for students who had developed a' ^'readiness" for such gains* ^A student 
.who reads widely and .with avid Interest but has not honed his or her 
analytic reading skills may 'be such a person* ' 

The next three score components to consider with regard to limits of 
STI effects are those specific to the activity of test taking itself* 
Component B^l (the match betwee'' tho domain 'of the examinee's dbveloped 
ability and*^test content) is li_.ely to yield only slight STI effects if 
examinees are clearly aware of the test ccmtent domain, and if the test 
does not contain an undue' number of.lstems requiring basic knoi^ledge most 
of them do not have* For example, STI for solving inequalities is more 
likely to have a meaningfully large effect on, SAT-*M scores to the extent 
that (1) the test has many items. carlling for this ability, (2) many stu*- 
dents have not routinely jearne'd^Chis ability; and (3) many examinees are 
unaware of the f^ct that such items are included in the SAT-M* \ 
Score compo\ient B*-2 (general TW; test familiarity, appropriate pacing, 
understanding genial directions, knowing when and how to guess, etc*) is 
susceptible to STI 'effects almost entirely to the extent thSt 
the examinee is initially test-naive* Thus, for tjie most part, any score 

"^■nciiease due to STI directed to component B-2 is evl'denc^ of having helped 
students r^eive the credit to which they are due, rather than havihg 
fostered any Rlnd^of "beating the test" resulting in "excessive" score 
gains. Note that gatn of^his sort; ■ an increases (rather than an" infla- 
tion) "* * ; of the studentsJ.,^st sc^re without improving the uhderlying 
ability" that need not imply that "the student may simply' g'ain admission 
to a college where his probability of 4oing Successful work is low" 
(Coffman and Neun, 1966, p* 1)* ' 

The next score component, B-5 (specific TW; similar to general TW but 
referring to item format and other item characteristics), is the only 
component that poses a problem regarding possible "excesslve'I gain from 
STI. Ttie problem arises in the case of^ complex item forjnats which. In 
their complexity, tap a kind. of methods variance conceptually independent 

. of the mathematical or verbal Aptitude the SAT Is intended to -measure* 
Vernon (1954), in reviewing the British literature on coaching, .concluded 
that more complex item formats are likely to be more coachable* Loret 
(1960), in his review of SAT content from the test's Inception in 1926 to 
1960, made the same "observation, and noted a steady trend in both the 
mathematical and verbal partes of the test toward simpler, more straight- 
forward item farmats* Nevertheless, for pragmatic reasons there remain in 



use item formats that are sufficiently complex to allow an undesirably 
large STI effect favoring students vfio are given help in learning how to 
deal^vith the item format complexitiest Aside from dropping^ such item for- 
mata^^ltogether, the problem can be reduced in'the folloving ways: <1) b^ 
keying the number of such items proportionally lov; {2) by imposing appro- 
priate test specifications vithin item format (c,f,, Wlllner*s suggestion, 
nofted earlier, for i^nimxzlng the role" played by vord association in solv- 
ing verbal analogies); <3) by expandiiig andgriilarjfying directions given 
within each test; and <4) by provi din g^j^siedis semination of information ■ 
describing these Item formats and"^instruction about hov to cope vith their 
complexities* To the extent that these four meas^ures ate taken, the magni- 
tude of STI effects for component B^3 will tend to fall vithin acceptable ' 
limits* 

Although the final pair of score components, C-l (level of confidence), 
and C-2 (level of efficiency), are in large measure spin-offs of the pre- 
ceding Steven, STI may Include direct attempts to ensure that ^.hese benefits 
do indeed follow from instruction directed to the other score components* 
Here again it may be n^ted that even instances of large score gs^ins re- 
sulting frofh changes in'.the'* score component^ in question are instances ot 
helping examinees to receive appropriately higher scores, rather than 
helping thetn make excessive gains that might be both unfair and a dis- 
service to^ the examinees by making their scores unrealistically high* 



FINDItIGS KEGARDING THE SAT-M * 

A basic discrepancy * '^In summarizing the findings of studies of STI or ITI 
for the mathematical sections of the SAT we begin with a basic discrepancy. 
The overall conclusions of Dyer, French, Dear, Lass, Frankel, tJhitla, and 
Roberts and Oppenheim are essentially negatfive, whereas those oJ" Pike and 
Evans, McCarthy, and Marron are positive. The former studies show overall 
average* score changes attri'butable to STI ranging from slight losses to 
gains up to about 20 SAT-M scale points, Some'of these differences were 
statistically significant, but none were considered meaningfully large. By 
contrast;* overall STI effects in the ^ Pike and Evans study are gains conser- 
vatively* estimated at about 33 SAT-M points, and those in McCarthy's 1975- 
76 data at about 41 points. Overall instructional effects ayeraged by 
Marron over 10 preparatory schools yielded a gain of about 79 SAT-M points, 
^Among the other studies In this review, average control group gains ranged 
from t5 points in Dyer's study to 66 in Frankel*s, and there wais a median 
gain of 31 points* Using .the latter as a rough estimate of what control sub- 
jects might hav6 gained in the Marron study, the effect of ITI would, te 
estimated as 79 minus 31 equals 48 poitits. 

Interpreting the discrepancy * In. considering the discrepancy between 
studies in which (>ositivfi overall results were reported and those in which 
negative results were reported, we may consider how to Interpret the dis- 
crepancy < how seriously to take^the positive results, and then examine why 
the discrepancy was observed* In interpreting the discrepancy, it should . 
be recalled that in principle even a single study showing substantial 
effects cannot be refuted by any number of .studies failing to d9 so* It 
follows, of course, that mixed results across studies cannot be dismissed 
as simply indicating that meaningful effects were somehow due tb happen- 
stance, or that mixed results indicate basically no effect over thei set of 
studies summai:^zed, ^ 

Credibility of the positive findings * On the other hand it Js reasonable 
to demahd of a study that obtBitis positive results contrary topmost other 
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actually done, and that the limits of possible score .gains linked to the 
several score components should also be considered. "^-^^ 

Applying the sc<^re components model . Having considered the^gen^ral out- 
comes of instruction for the SAT-M, and tjie questions that arose from a. 
disparity between those studies failing to show an STI or ITI effect 
and those succeeding in doing so, we may next consider the findings not as 
overall outcomes but as outcomes related to the several score components. 

Consider firsjt the four score components having to do with developed 
mathematical aptitude. It will be recalled that the three components sub- 
ject to STI effects (A-2, 3, and 4) were generally present in the studies 
that demonstrated STI or ITI effects and absent in those that failed .to do 
so^. We have also noted thaL limitations in^ component A-1 (developed ability) 
may have contributed to the lack of STI effects in the Roberts and Oppen- 
heim stud> where the gap to be bridged may have been simply too large for 
STI to have an effect. On the other hand, the Pike and Evans instruction was 
effective over a .wide range of initial SAT-M scores. The Frankel data" 
showing control ^ubject gains of 66 points provide an instance of sub- 
stantial growth in component A-1. This is an interesting discovery not 
only because of its magnitude .but also because for most' students this is 
apparently, the effect of studying advanced levels of high school mathe- 
matics (most students in the school take four ye^ars of mathematics). The^e 
findings suggest thac although the mathematics required to answer'SAT-M 
items Is intentionally limited to nlnth^ or tenth-grade content, mathe- 
matics beyond that level serves not only as review but also to facilitate 
answering SAT^M items. This in turnsuggests that for mathematics the 
aptitude-achievement distinction is relative and implies as well that one 
way to increase mathematical aptitude as measured by the SAT-M is to take 
additional courses In that subject area. 

The importance of component A=-2 (review) is suppo'rted by data in three 
of the studies that reported no meaningful overall STI effects. The studies 
by Dyer, French > and Dear all showed STI gains of 28 or 29 SAT-M points for 
examinees not currently studying mathematics but much smaller gainst for 
those who were taking mathematics courses. Some support for the possibility 
that instruction for components A-3 (integrative learning) and A-4 (analytic 
skills) may .lead to a subsequent increased rate of growth in riiathematical 
reasoning ability is provided' i;i the Pike and Evans study, where it v/as 
observed that particip^ants not only gained between pretest and posttest but 
gained an average of 24 additional points between the pocttest and the post- 
posttest that was taken four manths later. 

We may next examine STI or ITI effects related to the three score com^ 
ponents that have to do directly with test taking. This instruction is a 
kind of ^'teaching to the test", but as noted earlier its impact is to help 
students overcome test-specific obstacles that cause them to receive' in*- 
appropriately low test scores. Component B-1 (the match between an exam*- 
inee^t? developed ability and the test content domain) was addressed as part 
of the content review in the pike and Evans study, and presumably in those 
of Marroo and McCarthy as well. It would be desirable to use diagnostic 
test information as u^ell as item content information in those and in futtire 
studies to see whether filling specific gaps such as computing averages and 
solving inequalities has a demonstrable effect. 

All studies presumably gave at least some attention to component B-2 
(general TW) . If, however, there is any strong conclusion to be reached 
from the studies reporting no meaningful STI effects^ it is that inst:'uc- 
tion for general TW in the form of a few general rubrics such as **use y:>ur 
time well," "answer if you think you know the correct choice or If you can 
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eliminate at least one alternative," and of loosely structured group prac*- 
tic^ and discussion sessions is quite consistently ineffectual* This of 
course has direct implications regarding the probable value of much of 
the commercially provided test coaching* The aspect of general TW given 
particular attention in Pike and Evans was that >f knowing when and how to 
guess, giveti partial information* It was found a"" there was considerable 
confusion on the part of students and t^acheiv ^xiko on questions of the 
scholastic propriety, fairness, and efficacy ot guessing .when partially 
informed, and a related confusion regarding the implications of the formula 
score that ^'corrects" for guessing by subtracting, a fraction of a point for 
wrong^nSwers* Classroom demonstrations of the results of guessing when 
there was no itif9rmation, and again when either two or three of five choices 
could be el^lminated, allowed students in each class to derive the conclusion 
that over a set of items ^'partial credit is given for partial information*" 
This component of TW should also be examined for Its effect on test-taking 
behavior and dn test scores* Component B-3 (specific TW), particularly for 
the relatively complex item formats (data sufficiency and quantitative 
touiparisoh), was also given considerable attention £h phe Pika and Evans 
study, which was probably respo*isible for the greater STI effects observed 
for these formats than were found for the much simpler "regular mathe- 
matics" item format* 

Confidence and efficiency (components C-1 ^and 2) in *test taking are 
most likely to increase if substantial efforts on the earliet score com*- 
ponents have been made^ Thus, in the three studies involving content 
instructjLon, it is very likely that at least some gains attributable to the 
secondary effects of .increased confidence and efficiency in .test takings 
also occurred* To enhance this effect^ Pike and Evans incorporated 
occasional timed practice tests that were tailored to the instruction pre*- 
viously received^ in order to provide the students an awareness of having 
Increased their test-taking capabilities* 

'FItlDIHGS REGARDING THE SAT-V 

Would f indinfil STI effects be feasible ? It is a common observation that 
verbal aptitude is not likely to be as subject td coaching or STI effects 
as is true of [mathematical aptitude* In the pjreface fco the Pike and '^Evans 
(1972) monograph, for example^ Kendrix^k stated that* **By now it. has been 
fairly definitely settled that the verbal part of the Board's Scholastic 
Aptitude Test '(SAT) is Impervious to coaching* The mathematical part seems 
similarly, though perhaps not so thoroughly, proof against special prepa- 
ration, but the question of mathematics is complicated by the fact that 
some students po not take mathematics in their senior year of Secondary 
school^ and lead lives very nearly undisturbed by quantitative thought* 
For them, it is only reasonable that a little review or warming-up 
would be helpfLtl* * * *** (p* v)* We have noted earlier (page 25) the 
College Board <^ommission on Tests* statement that concludes, * * if 
verbal and mathematical aptitude, especially verbal aptitude , can be 
developed within the length of, say, a school year, no one has yet demon- 
strated away tio do it^^ (emphasis added)* 

The above colnsiderations, and the important role mathematics content 
instruction appeared to have in the studies showing meaningful STI gains 
for the SAT-M, make any study purporting to show major SAT-V gains appear 
suspect. However, a coirparison of SAT--M and SAT-V findings among seven 
studies reporting coaching/STI effects on both (Dyer, French, Lass> Dear, 
Frankel^ Whitla^ and Roberts and Oppenheim) can serve to check on this 
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this is not a large gain on the overall SAT-V score, it would be a mean- 
ingful effect as a component of that score, limited to, the analogies part 
of the test. The Uhitely and Dawis study showed gains quite consistent in 
degree to those reported by Moore, the difference being that the group 
studied were inner-city high school students rather than graduate students. 

Interpreting the discrepancy . Confronted again with mixed results across 
studies we note once more the logical primacy of 'studies demonstrating 
effects over those failing to do so, hut also reiterate that if they are 
to be fully accepted, positive SIX conclusions must be strongly supported 
by research design ^nd data. In this respect there are shortcomings in 
both studies addressed directly to raising SAT-V scores. The Pallone STI 
findings^ ^re based on only 20 e:;p^rimental subjects, a number small enough 
to ^indicate clearly the need for replication before great confidence can 
be placed in the findings. FurthermQre, both the STI and the ITI effects 
'were observed in a single school, and the lapk? of control subjects leaves 
open the question of how much cf£ the observed gain was attributable to the 
programs of special instruction and how much to other factors operating in 
the school in question. The fact remains, however, that the gains were 
extraordinary. Control subject gains on the SAT-Vin the superior schools 
studied by Lass, Frankel, and Whitla were 41, 38, rnd 39 points respective- 
ly. If we then estimate that school effects and any other sources of growth 
and practice in the Pallone school would ordinarily be about 40 points, the' 
average gains attributable to STI and ITI (involving 80 students) would be 
58 and 82 points. Using the same estimation of expected control subject 
gains, Marron*s data would indicate SAT-V gains of about 17 points. Thus 
the students appeared to make gains, o Jy slightly greater than they could ' 
have expected from attending an exceptionally gpod high school over the 
Same period of time. Even though the average gains on the SAT-V for stu- 
dents taking the SAT in April or May of one year and dgafn in December or 
January of the next are usually in the neighborhood of 15 to 25 points, the 
57 SAT-V point gain observed for the Marrbn stu y is large enough at least 
to raise doubts about the Commission on Tests' statement about raising 
verbal aptitude scores within a year, particularly since 10 different 
schools were involved. 

Jacobs* finding of gains ranging from 44" to 77 points on the ECT are 
not only substantial, particularly as they were obtained with only 18 hours 
of instruction, but are also impressive in the sense that the research 
design was strong, with random assignment of subjects to control or experi- 
mental groups. The question is whether these findings on an achievement 
test can be interpreted as relevant to the SAT-V, an aptitude test, partic- 
ularly since achievement tests with their content orientation ara ^nerally 
considered to be more susceptible to STI* Arguing for the relevance of 
Jacobs' findings ^re three observations: (1) achievement tests such as the 
ECT are viewed as becoming increasingly more like aptitude tests as efforts 
are made to have questions that will generalise across many school curricu-^ 
lums; (2) parts of the SAT-V, particularly antonyms and reading comprehen- 
sion items, ar^^easures of vocabulary and reading ability that could, as 
well be viewed as achievement measures; and (3) the ECT contains complex 
Item formats, and results of instruction for coping with these complexities 
may have implications for possible vulnerability of complex SAT-V item 
formats (particularly analogies) to similar kinds of instruction. 

Marron*s finding of an 83-point gain on the ECT for some 350 students 
serves primarily as a rough confirmation of Jacobs* findings, although ITI 
was required to do it. Moore*s data must be considered as tentative, in 
part because there were only 19 experimental and 19 control subjects, and 




in part because the Item format used was rather more cumbersome than thst 
employed in the SAT-V, The Whitely and Dawis study involved 184 students 
from two high schools, giving an adequate data base from which to work, 
and in addition it involved a ratheir sophisticated experimental design. 
The major question about the generalizability of their data to the SAT'-V is 
that the researchers went to consi<lerable lengths to^ keep all the analogy 
items in the study at an unusually low vocabulary level. Although the 
vocabulary load is also kept reasonably low in analogies used in the SAT-V, 
many of the more difficult items involve fairly difficult words in order 
to test the ability to recognize -subtle relationships. It may well be that 
the Whitely and Dawis study is directly relevant to possible score changes _ 
for very low-scoring subjects on the analogies part of the test, but this 
would have to be established in further studies, 

ExplaininR the discrepancy . The next question is why the differences in 
overall conclusions may have occurred. In most of the studies for which 
negative conclusions regarding STI were reached, the instruction tended to 
be brief, relatively uncontrolled, ani not directed toward verbal abilities, 
although an emphasis was placed on individual or group practice in test 
taking* The Coffman and Neun study departed somewhat from this pattern iti 
that it was designed to determine the effect of a presumably typical accel*- 
erated reading course on SAT-V scores. This course involved about 50 hours 
of instruction as part of a colleger-credit course emphasizing rapid read- 
ing with relative accuracy. The Pallone STI study was comparable to the 
first seven negative studies in the number of hours spent on instruction; 
his ITI study was comparable^ to the number of hours of instruction in the 
Coffman and Neun investigation. The difference in results appear not to 
lie in the number of hours of instruction. The Pallone instructions for both 
STI and ITI differed sharply from those given in any of the negative studies, 
in that Pallone' s Instruction was deliberately designed to go beyond the 
"coaching** that had so regularly been found ineffectual. Instead, the 
instruction focused directly on reading, vocabularly, and verbal reasoning 
abilities that the SAT*-V is intended to measure. The program was highly 
systematic and controlled, involving instruction in intensive reading, 
skimming, critical reading, exercises in answe'ring reading comprehension 
items, and solving verbal analogies, Marron*s study was characterized by 
the large amount of time involved (a full semester directed expressly to 
raising selected test score&> , although the amount o^ time devoted to 
preparation for the SAT-V is not clear. In any event, some kinds of verbal 
content instruction cin be assumed and perhaps Instruction directed to 
specific item formats as well. The studies of instruction for analogy solv- 
ing (btoore; Whitely and Dawis) are not necessarily inconsistent with results 
in the studies reporting no meaningful overall gains on the SAT-V, This will 
be given further comment. 

As was true for the Pike and Evans study of SAT-M instruction, the 
Pallone study of STI and ITI for the SAT-V differed most markedly from 
the others yielding negative results in the degree to which instruction 
was substantive and controlled, with emphasis given to effective review 
{A-2) , integrative learning {A-3) , the teaching of relevant analytic skills 
(A-4) , and instruction specific to item format characteristics (B-3) , On 
the one hand, this suggests that the generalizability of the Pallone re- 
sults is limited to STI or ITI efforts that have a similarly strong con- 
tent orientation, and perhaps specific TW instruction a.^ well. On the 
other hand, these characteristics of the Pallone instruction clearly fall 
within the sphere of STI and ITI questions raised in various College' 
Board and other statements on these topics, and by student, parent, and 
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professional education organizations. Generalization from the Marron re- 
sults for S^T-V instruction is limited by the recognition that a consider- 
able amount of instructional time was required to obtain the gains re- 
, ported. 

It is interesting to compare the importance of instructional content and 
the amount of instruction as they affect SAT-V scores. This is most evident 
in comparing the Pallone study to that of Coffman and Neun. The considerable 
amount of time^spent in developing reading skills in the Coffman and Neun 
study yielded trivial gains and even losses in SAT-V scores, whereas the 
sharply focused curriculums used in STI and ITI in the Pallone study yielded 
sizable scare gains. This difference between ^comparatively passive, un- 
focused study and active study directed to specific Skills runs counter to 
the common feeling reprresented by French and Dear*s <1959) conclusion that, 
rather than seeking coaching, an eager College Board candidate . . . would 
probably gain at least as much some review of mathematics on his own ^nd 
by the reading of a few good books'** <p. 329).' 

Item format differences . Only one of th^^ 10 studies of SAT-V instruction 
reported differences by item format. This itay have been in part because not 
many items of any one kind were present, since four item types were used, 
thus making comparisons risky, and in part because in most if not all the 
studies attention was focused on the overall results. This is unfortunate, 
because there is good reason to believe that because of differences in 
format complexity some item types may be more susceptible to instruction 
than .others. In the French study, SAT-V instruction was provided in only 
two of the three schools. In the first school, two-thirds of the 18-point 
gain attributed to instruction was observed for analogies. For the second 
school, in which a 5-point gain was observed, nearly all the effect was 
due to antonyms. The difference between the two schools is perhaps best 
attributed to differences in instruction, the latter not having been 
closely monitored or controlled. In ^ny event, the analogies effect 
noted in the one school is consistent with general evidence regarding the 
relationship between STI effects and item complexity, and with the studi.es 
of Moore and of Whitely and Dawis that were directed specifically to 
verbal analogies. 

FINDINGS REGARDING TW ' 

Defining TW . Again, TW will be defined as the set of skills and knowledge 
about how to take a particular test that allows the individual to display 
his or her abilities to the best advantage. Implicit in the definition is 
the recognition that some aspects of TW must be used if the examinee is to 
receive proper credit for the knowledge or ability being tested, but that 
other aspects of TW, such as taking advantage of "specific determiners,*' 
may allow the examinee to receive more credit than is appropriate. The 
latter aspect, however, is likely to be at a bare minimum for profession- 
ally developed tests such as the SAT. 

Guessing * It was noted above that guessing, which may be defined as 
answering a test question in the absence of certainty as to the correct 
response, usually involves either a more or less spurious hunch or feeling, 
or the use of partial information, and is seldom the sort of blind selec- 
tion that often first comes to mind when the term is used. It was also 
noted that partial information situations in which guessing is an appro- 
priate behavior are necessarily a part of most objective testing, particu- 
larly when the test is at an appropriate level of difficulty. 

Considerable thought and research have been given to the question of 
whethfer to use a "correction formula" to compensate for individual differ- 
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ences in guessing tendencies. The results are far^ f rom conclusive. Argu- 
ments for and against the use of correction formulas were also given . 
earlier. The main conclusions to be drawn from these are that: (U more 
information is needed on the subject to resolve differences in findingi> 
and conclusions;^ <2) better within-test or before-test answering in- 
structions may be needed (Lord, 1975); and (3) both ^'rights only'* and 
^'correction formula'* scoring procedures pose answering dilemmas to exam- 
inees, with the former emphasizing the decision of how to select an 
answer and the latfeer emphasizing that of whether to select an answer 
when in doubt. ^ 

^ Risk taking (RT) . Individual differences in guessing tendency at a 
given level of. uncertainty of the correct answer and under a given set of 
instructions about guessing may be. described as differences in risk-taking 
(RT)ibehavior. One set of basic findings reported above regarding RT was 
that of Swineford and Miller (1953), who studied RT under instructions 
that encouraged, discouraged, or were neutral to guessing. They found that 
(1) there was some' guessing Mnder all three sets of directions, (2) in-^ 
structions inhibiting guessing were more effective than those encouraging 
it, and (3) there was little relationship between RT (deciding wh4n to^ 
guess), and ability. Another basic finding was that of urehan, Koehler, 
and Slakter (1974), who found that an individual's RT tendency is^rela- 
tively stable across different tests at given time, but that RT tends to 
decrease over grades 5 to 9, then becomes relatively stable, at least 
through grade 11. They noted the implication of this finding, that the 
contribution of RT strategy toward maximizing test scores actually tends 
to become less between grades 5 and 9. 

Answer chanainR . The question of whether to change test answers moves 
from the question of when to guess, toward that of how to do so. Excellent 
summaries of studies of answer changing are found in Lynch and Smith 
(1975), and in Mueller and Wasser (1977). Some of the conclusions generally 
agreed upon are listeti on page 18. 

TW for reading comprehension items. In considering TW as it applies 
specifically to particular item types, the shift from when to guess to how 
to guess is particularly evident. In a comparison of free-response and 
multiple-choice testing of the reading comprehension of British and Amer- 
ican examinees, Vernon (1962) found a test-sophistication factor in the 
multiple-choice responses of British examinees, who were generally un- 
familiar wi^h such tests, that was much less evident in American responses. 
The difference was more pronounced fcr the relatively complex reading 
comprehension items than for the more straightforward vocabulary questions. 
This would suggest a relatively greater need for^TW instruction for stu- 
dents on the more complex item format, reading comprehension. Two strat- 
egies were observed by Pyrczak (1972, 1974) in studies of answering be- 
havior when the reading passages were not available. One made use of 
interrelationships among the items in ^ given set that accompanies a 
given reading passage, and another used such dev?.ces as selecting general 
principles rather than specific facts. 

TW for verbal analogies . Connolly c*nd Wantman (1964) used "think aloud" 
procedures with nine subjects and provided an impressionistic report of 
analogy-solving processes. One conclusion was .that words among the alterna- 
tive choices influenced how the stem words were interpreted. Another was 
that the students differed considerably ^in their methods of' solving the 
analogy problems. These observations suggest the need for in';truction 
directed to score components A-4 (relevant analytic skills) and B-3 
(specific TW) . 



Other studies have examined the relationship between word associations 
and th^ solving verbal analogies* Willner (1964) demonstrated that' on 
many'verbal analogies (drawn from a variety of tests other than the SAT) f 
nearly half the items could be answered correctly using word associations 
alone* i*e*, without having first to educe the relationships for a giveji 
item and then solve' the analogy on the basis of the educed relationships* 
He noted that in some instances word associations tended to. hinder, rather 
than facilitate solving particular analogies* and thus the opposite effect 
seemed to have occurred* His proposed, solution to the problem is to con- 
struct analogy items that are substantially free of tfhe word association 
effects* This seems clearly desirable* since 1:he use of facllitative word 
associations to get a higher score will give some students an unfair ad- 
vantage; and the susceptibility to the distracting power ^of other word 
associations will put test-naive students at a disadvantage* Even if the 
two effects were well balanced across, a set of Items* the problem remains 
^hat some meaningful part o£ score variance will occur because. of this 

ictor* rather than to examinees* relative ability to solve verbal analo- 
gies, i*e*, to educe and subsequently usf. structured relationships be- 
tween pairs of words* Another way of reducing the problem is to provide 
instruction in solving analogies* tt may be that simply expanding th^ 
within-test directions to include one sample -Item and its solution would 
be adequate* 
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Recommendations for Fufui« ReseaKh 



Specific recommendations for research on short- and intermediate-term 
instruction for the SAT, testwlseness, and related topics will be pre- 
ceded by a discussion of the objectives toward which the research would 
be directed and a discussion of general research design considerations 
derived from an evaluation of the studies reviewed. in this survey* 

RESEARCH. OBJECTIVES . , 

The immediate objectives of the research to be'recomrtfeended will be pre- 
sented after discussing the ultimate objectives toward which these would 
be directed* ^ ^ 

Ultimate objectives *. There are tjiree ultd^mate objectives toward which 
the^research would be directed* The first is to maximize the fairness 
and validity of the SAT with regard to its short-term and intermediate-* 
tern instruction (STI and ITI) score components* The second is not to dis- 
courage concern and activity regarding test-preparedness, but rather to- 
foster realistic understanding and expectations regarding possible out- 
comes of STI and ITI* The third, which would derive from the pulsuit of 
the first two, is the emergence of a more basic understanding of the 
processes involved in test taking and contributing to aptitude test scores* 

In con&iderUiig these objectives, the score components, model will again 
serve as the organizing principle* Differences in component A-1, aptitudes 
that, have developed over a long period of time, do not fall within the - 
purview, of this survey, because the question of special instruction for the 
SAT, whether as STI or;lTI, is by definition excluded from consideration 
for that component* The final component'in the mode' (D-^l, error varian<ie) 
is also excluded by definition, since "error" as uped hetre in its tradi^ 
tlonal psychometric sense is score .variance. not attributable^ to the factors 
being considered* The remaining eight components are all subject to various 
STI and IXX effects and as such are those with which we will be concerned* 

The issue of test fairness, to which the first research objective is 
addressed, is necessarily raised if there are meaningful STI and ITI score 
effects because of differences in the availability of instruction and even 
in the awareness of its possible effects* The fairness of the SAT with 
respect to the effects of special instruction can be maximized in four ways 
The first is by informing examinees and educators of STI components that 
may increase academic aptitude performance (as distinct from underlying 
academic competence)* These instructional components correspond to score 
components A-2 (review), A-3 (integrative learning, overlearning) , and A-4 
(learning relevant analytic skills)* The second way is by ^minimizing the 
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ing such_ xnstructiDn. Appropriate emphases and expectations may be en- 
couraged by informing students and educators of the appropriateness of STI 
where needed for score components A-2, 3, and 4 (inci^easing scholastic 
aptitude performance), B-1 (filling in important gaps in assumed knowledge 
or skills), B-2 and 3 (providing general and specific TW) , and Q-1 and 2 
(helping examinees develop test-taking confidence and efficiency). In- 
appropriate STI or ITI emphases, expectations, and activities may be dis- 
couraged primarily by informing students and educators of the limitations 
of special instruction corresponding to the components of test scores and 
to related examinee characteristics. This could begin by noting that for 
most students score component A-1 is by far the largest and is by defini- 
tion not subject to STI effects, and by noting that because of component 
D-1 (error variance) any program of STI or ITI may result in a certain 
percentage of substantial score gains that are attributable entirely to 
chance and do not, therefore, constitute bona fide evidence of STI effects. 
Attention can then be directed to those score components that nEay be in- 
fluenced by STI but for which such effects are necessarily subject to 
strong limitations. The effects of STI addressed to component A-2 (review), 
for example, are subject to two limiting factors. First, "review" pre- 
supposes that relevant material had already been learned earlier, and 
second, the effects for a given examinee are necessarily limited by the 
extent of his or her need for review. Similar limitations related to 
**readiness" for STI and the degree of need for it apply to components A-3 
and 4. SH effects for score component B-1 are limited by the number of 
test items calling f or^ the required knowledge or skill, and by the degree 
to which the examinee is lacking in these skills. Components B-2 and B-3 
are limited by the degree of test naivete to be overcome, as well as the 
need for TW that the test imposes. On the latter point, fpr example, an 
examinee who can answer most test questions correctly with confidence has 
little need for an effective strategy for guessing on the basis of partial 
information; the converse is true, of course, for the examinee who is only 
partially informed on a large percentage of the test questions. STI for 
components C-l.and C-2 is limited in its effects primarily by the extent 
tp which examinees are being handicapped 'by a lack of confidence and 
efficiency in test taking. . 

Perhaps the clearest instance of STI limitations "that can be pointed 
out is instruction consisting almost entirely of drill on sample test 
questions. Such instruction is not only academically unsound but misses 
most of the avenues for having a meaningful effect on test scored. It 
entirely bypasses components A-2, A-3, A-4, and Bl and deals only 
peripherally with the TW compdnents B-2 and B-3. Furthermore, it is un- 
likely to have more than a very modest effect on C-i, since confidence can 
best bje built oa a realization of increased competence in coping with , the 
informational and TW requirements of the test, or on C-2. ^ 

. Before leaving the topic of the limitations of STI, it is useful to 
address the, paradox that despite these limitations, an examinee's' prob- 
lems with one or more of the score components maj be such that appro- 
priate STI could result in a very large score' gain. The resolution of the 
paradox is in the realization that the limits are in the form of a 
*'ceiling" effect, but that there is no equivalent "floor'* effect. For 
example, some examinees may be so lacking in test-taking confidence that 
they "bomb" on the SAT, ^and remedying this may appropriately result in 
meaningfully and appropriately large score gains. However, the ceiling \ 
effect is such that: (1) the STI cannot yield a test score higher than 
that warranted by developed aptitude; and (2) such large gains can only 
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occur for tho.se examinees who were ^^.riitially severely handicapped by poSt 
test-prepajcedness^ '^o put it more generally, those students wh® are already 
well test-prepared will have little^gf&in from STI, whatever its quality 
and duration, their^ own levels of motivation, and so on< 

The third and ultimate. objective for the recommended research wouid^ 
'be derived in the process of realising the first two* This objective as 
to gain a more basic understanding of the processes involved in test- 
taking that contribute to the test scores* Such an understanding can 
provide a good foundation for <a possible evolution in'aptltude toting, 
and could assist in providing' information for diagnostic .and placement 
purposes, rather than for. admissions decisions only; 

Immediate objectives * ^Recommended immediate objectives of future re- 
search-regarding STI and TW would be to study systematically the effects 
of STI (or ITI) directed to the several components of SAT-M and SAT-V test 
scores, taking into account selected characteristics of examinees, test 
items, and special instruction* Examinee characteristics of most direct 
interest would be those related to the several test score components* 
Examples of these, for which measures before and following STI wcald be 
desirable, are as follows* For componentst A-2 and A-3, the level of 
mastery of. skills such as computing ratios and proportions^ for A-4, ob- 
servations of item-answering processes, and facility in locating required 
information by'scanning reading passages; for B-1, measures of information 
and skills (such^as understanding the test directions) assumed in those 
taking^the SAT; for B-2 and B-3, degree and kihds of TW and test-naivete, 
including j:ho«)e involved in guessing behavior; and for C-l and 0-2, indices^ 
of levels of confidence and efficiency in test-taki^* Among the item 
characteristics of interest would be item format, dif ficulty, ^ fineness of 
distinction between the distractor and^the keyed choice, and so on* Char- 
acteristics of the STI would include the ins^uctional materials used and 
the conditioi^s under which they weVe used* For score components 'A-2, 3, i 
knd 4, the use or ,nonuse of such materials .as mathematics review and 
vocabulary building textbooks would be of nterest* Similarly, instruction 
for components 'B-1, 2, and 3 might be examined for differences associated 
with the use or nonuse of test familiarisation materials more or less re- 
sembling the SAT descriptive booklets* For components C-1 and 2, effects 
resulting from taking a practice test, particularly one under conditions 
closely paralleling the SAT <such as tjie PSAT) would be of interest* What-* 
ever the instructional materials, other variables of interest would be , 
whether the STI was undertaken alone, ^through a tutor, in a more or less 
typical classroom setting, or in commercial coaching sessions or their 
equivalent* ~ ^ 

DESIGN CONSIDERATIONS 

The bulk of the studies reviewed in this survey haVe contributed little 
toward providing information that is either broadly generalizable or 
cumulative* This may be attributed in part to^'the considerable complexities 
of the questions involved, in part to the provision of STi' that was loosely 
structured and monitored, with little certainty of exactly what was pro- 
vided, and in part to a tendency to-consider the results in an overall way 
with scant attention piiid to systematic differences nm«>ng score components, 
examinees, test item? and instruction* It would seem important, therefore, 
that future research on STI and TW as they apply to the SAT should be given 
strong d^^ign consideration* This does not mean, of course, that pjlot 
studies should be excluded, nor does it m^an that only large-scale, costly 



studies should be undertaken* What it does mean is that future studies of 
any ^^lagnitude should be tightly designed and fit into a matrix of Inter- 
related studies that will collectively shed considerably more light on 
these questions and gi\e more information than is now available* These 
studies, directed to the objectives presented above^ can be given :.trong 
design characteristics by partitioning^between STI effects and ptactice 
and growth, by partitioning among score components and examinee, item, and 
instructional characteristics, and by the systematic gathering and use of 
detailed pretest and post test .information* 

Pretest and posttest data should.be collaeted but not sliaply for over-* 
all test scores averaged over all examinees*^For a given study, contrasts 
in STI effects associated with item format, or with other item character- 
istics may well be desirable* Such contrasts should be/designed into the 
study, with consideration given to adequate saropHjig^^ an item pool for 
each category of Items that is of/lnterest, and in particular they should 
have a large enough number of ^items in each category for meaningful and 
y^stable score differences to be deiifenstrated where appropriate* Pretest 'f 
posttest measures ^n examinee variables are desirable across a fuller ay 
of score components than have generally be'^n used* This is particularly 
true for such aspects of TW as guessing strategies and RT tendencies* 
Ideally, a te^st-prei^aredness prgfile over the several score components 
could be of great valu .* 

RECOMMENDATIONS FOR SAT-M RESEARCH 

The Pike and Evans study of instruction for the SAT-M demonstrated very 
cljarly that meaningful STI effects, both overall ajid differentially by 
item format J can be obtained if the instruction Is well designed and if it 
covers a sufficient range of SAT-M score coinponent;s, including mathematical 
content (A-?, A-3, and B-1) and general ar\d specific TW <B-2 and B-3)* 

Recommended future research would be in the form of a coordinated series of 
studies, replicating that of Pike and Evans but differing primarily in an 
emphasis on partitioning the effects according to selected examinee, item, 
and instructional characteristics as they apply to selected test components* 

The outcome of a set of such studies would include: (1) an extension o£ 
the sti*dy to e^f ,cts observed for inner^-city students or others likely to 
be at lower levels of developed mathematical^ aptitude; <2) essentially a 
replication, but dropping data sufficiency items and differentiating be- 
tween the content $core components (A-2, A-3, and B-1) and the TW 
score components (B-2 and B-3)* The use of diafe.iostl,c pretests and post- 
tests of basic content skills and knowledge, and of/TW abilities and 
attitudes, would ^e an lutegral part of the study Resign* Over the set of 
studies there. would also be differentiation based on selected item char- 
acteristics within item format, differentiating between instruction pro- 
vided by self-study and instruction provided under classroom supervision* 

There are, of course, many ways of dividing the work into a set of re*; 
lated studies* Ainong the possible studies that would seem most to warrant 
current consideration are the following: (1) an extension of some part of 
the Pike and Evan$ research to different exam:{:nee groups, in particular 
inner-city stude or students identified as highly 'Wth anxious**; <2) a 
stud/ focused primarily on TW score components — it is in this realm that 
the questions of lairness and test valj^dity are most problematic; and (3) a 
study directed primarily to relevant mathematical content* Either of the 
last tu*o could be profitably expanded or fpllowed up by a companion study 
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that would allow'a comparison between se^f-st^u^y «nd st'»dy provided in the 
classroom* j * ^ 

RECOMMENDATIONS FOR SAT-V RESEARCH 

There is no SAT**V counterpart to the Pike and Evans SAT-M study upon which 
to build a series of subsequent studies* There' 'are, however, two studies 
that can serve in conjunction with the researchtobjectives and design 
considerations already outlined to give some direction to future SAT-V 
research* 

In the Pallone study, instruction was carefully designed and monitored 
and was directed to the full array of STI or .1X1 score opmponents* As with 
the Pike and Evans study of the SAT-H, the resulting score gains were both 
pragmatically and statistically significant* Th^ fact that oinfljj a few sub- 
jects in a single school were involved, with no control subjects, severely 
limits the generalizability of the findings and fails to allow ^ partition- 
ing between instructional effects and those attributable to growth and 
practice* ' ^ 

. The Whitely and Dawis study was limited to STI for verbal analogies* 
Within that constraint its strong design mates it a good study upon which 
to base some of the decisions for subsequent research plans for the SAT-'V* 
The subjects were 184 students randomly drawn from two high schools and 
randomly assigned to treatment and control groups* The instruction was 
carefully designed and administered, and in the-sense that the underlying 
skill of educing relationships is .one that can.be taught and may be con-* 
sidered a content skill, the instruction covered both content and TW score 
components* Resulting score gains were statistically si£ .ificant and were 
also {pragmatically relevant to the SAT-V if STI effects for component test 
scores, as well as for overall scores, are considered* The basic limita- 
tion on extrapolating from the Whitely and Dawis findings to the analogies 
part*of the SAT-V is that vocabulary was kept at an unusually low level in 
their study* The vocabulary requirements of SAT^-V i^nalogies of above average 
difficulty often include words that are rather difficult because they are 
needed to test the ability to educe more subtle relationships* 

The current status of firm information regarding possible STI effects 
for the SAT-V is particularly problematic* Despite t^e importance of SAT-V 
scores, the issues of fairness and vc'^idity tied to possible STI effects, 
and the e>cistence of a marked discrepancy between studies reporting nega- 
tive findings and those reporting meaningful SAT-V score gains, no study 
exists that seriously tests the strong assertions moted above that the SAT-V 
is apparently impervious to the effects of periods of special instruction 
even as long as a year* As noted above, the general belief that the SAT-V 
must be considerably less susceptible to STI than the SAT-M does not hold 
up in a comparison of SAT-M ^|i^SAT-V effects in the several studies di- 
rected to both* Furthermore, several years have passed since Pike and 
Evans demonstrated STI effects for the SAT*M in 1972, and the Pallone 
study in 1961, which reported large STI effects for the SAT-V* What ap- 
pears to be needed is a study or set of studies using STI generally similar 
to that provided by Pallone but modified to meet the objectives and design 
requirements described above* The effects of STI should be studied in a 
manner that would allow a partitioning of results according to. test score 
components as they relate to specified examinee, item, and insuructional 
characteristics* Ap was recommended for the SAT*W, a partition\ng between 
content-related score components (A*-2, A*3, A*4, and B-1), and TW compo- 
nents (B-1 and B-2) would be desirable* Among examinee characteristics of 
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interest would be variables related to quality of education, and perhaps a 
group having the verbal equivalent to "math anxiety" pr "math aversion*" 
Most basic among item characteristic partitionings would be that of 
examining STI effects separately by item format: reading comprehension, 
analogie^, antonyms, and sentence comprehension* In doing so, instruction 
would be tailored to contents-related and TW score components as manifested 
by each item format* Again, the use of diagnostic pretests and posttests of 
basic content skills and knowledge, and XW abilities and attitudes, would 
be an integral part of the study design* Over the set of studies, informa- 
tion would also be gathered comparing self^study to study in classroom 
settings * 
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